Running Llama Aligned DeepSeek R1 on your Apple Silicon Mac using Apple Neural Engine

This article is going to show you how to run the Llama aligned DeepSeek R1 Distil models on your own computer using Apple Neural Engine. If you want to use just your CPU or any hardware other than an Apple Silicon Mac, see the other versions of this article here.

KoboldCPP is an excellent open source program that provides you with a Graphical User Interface to interact with LLMs, for the developers amongst you there is also an API using the familiar OpenAI format.

Blue Beck’s Llama Aligned DeepSeek R1 Distil models are fine tuned versions of Deep Seek AI's 8B and 70B R1-Distil models that culturally align with Meta’s Llama 3 series in order to be more suitable for public facing deployment in western countries, while retaining the reasoning capabilities the R1 based models are famous for.

Start by downloading the latest KoboldCPP for your operating system from the downloads page here, the version you need is “koboldcpp-mac-arm64”.

Most people should start with the 4bit 8B version of the model, which you can download from here. Once you've tried this and have it working there is more information in the notes at the bottom of the article for those of you with enough VRAM to try the other versions.

Once you have downloaded the model and KoboldCPP, it is time to get started.

You should satisfy yourself that the KoboldCPP executable that you downloaded from GitHub is safe (i.e. free from malware), here is one way to do this. Deciding to run the executable on your computer is your own responsibility and we in no way accept any liability for any consequences. Depending on your system, you may need to grant KoboldCPP permission to run in your Operating System.

When KoboldCPP starts, you will see a menu like this.

KoboldCPP Menu

Click the “Browse” button under where it says “GGUF Text Model”, then select the model file you downloaded, i.e. LlamaAligned-DeepSeekR1-Distil-8b.Q4_K_M.gguf

You can now click Launch in the bottom right hand corner, which should open your web browser pointing to a local URL (something like http://localhost:5001/#) where you should see the user interface for using the LLM.

The first thing you need to do is to click the “Settings” button on the top bar.

KoboldCPP Menu

This will open a window like the one below. Where it says “Usage Mode” select “Instruct Mode” from the drop down menu, and where it says “Instruct Tag Preset” select “Deepseek V2.5”.

KoboldCPP Menu

Now change tabs by clicking “Samplers” just under the heading of the settings window.

KoboldCPP Menu

Where it says “Context Size” set the number to 8000 (or higher), and where it says “Max Output” set the number to 2000 (or higher), you will need to change the numbers by clicking on them and editing rather than using the slider bar (DeepSeek R1 based models won’t work well with the smaller number range the slider gives, see the bottom of the article for an explanation).

Once you’ve done this press OK to go back to the main interface.

Everything is now ready and you can go ahead and ask a question, just type in the text box at the bottom and click the send button like in most chat user interfaces.

KoboldCPP Menu

For questions where the model uses an element of reasoning, you will see it output a tag <think> followed by its chain of thought before it gets to its final answer.

KoboldCPP Menu

If it stops while it’s still in this process, just press the send button again without typing anything. Once the model has arrived at its final answer, you should see output formatted like this.

KoboldCPP Menu

If you want to ask further questions that relate to your previous question just go ahead and type them like in any other chat interface, however if you have a new unrelated question everything will be faster and use less memory if you first click “New Session” in the top left (just press OK on “Really Start A New Session?”).



Additional Notes:
1) The reason for setting “Context Size” and “Max Output” to relatively large numbers is the fact that reasoning models such as those based on DeepSeek R1 output a lengthy chain-of-thought before answering the question. “Context Size” is effectively the maximum size of the whole conversation, and “Max Output” is how much output the model is allowed to generate before KoboldCPP forces it to stop. The default values for these (and even the maximum values of the sliders) are too small to accommodate the full chain of thought output.

2) If you have more free RAM you may wish to try the 8bit 8B version of the model, which can give improved results at a cost of approx 4GB of extra memory usage. This can be downloaded from here.

3) If you have plenty of RAM (48GB+) you can try the 70B model, note that this will be a lot slower than the 8b model and will run best on a computer with a lot of memory bandwidth such as those with the "Max" versions of the Apple Silicon processors. This 70B model is a really powerful model on a different level to the 8B version, far closer in output quality to the largest R1 model or other reasoning models you may have used in the cloud. Download the 4bit version from here, or if you’ve got even more RAM (80GB+) you can try the 8bit 70B version, which is downloaded in 2 parts (just open part 1 in KoboldCPP as long as part 2 is in the same folder), download part 1 here, and download part 2 here.